import pandas as pd
import numpy as npPandas: Dealing with missing Datas
Import Libraries
Import Pandas and NumPy, which are essential for data manipulation and handling missing values.
What is np.nan?
np.nan represents a missing value (“Not a Number”) in NumPy and Pandas.
np.nannan
Create DataFrame with Missing Values
This section creates a DataFrame containing missing values using np.nan.
data = {'A': [1,2,np.nan,4,5],
'B': [6,np.nan,7,8,9],
'C': [11,12,13,np.nan,15]
}
df = pd.DataFrame(data)
df| A | B | C | |
|---|---|---|---|
| 0 | 1.0 | 6.0 | 11.0 |
| 1 | 2.0 | NaN | 12.0 |
| 2 | NaN | 7.0 | 13.0 |
| 3 | 4.0 | 8.0 | NaN |
| 4 | 5.0 | 9.0 | 15.0 |
Detect Missing Values
Use isnull() to check which values are missing in the DataFrame.
df.isnull()| A | B | C | |
|---|---|---|---|
| 0 | False | False | False |
| 1 | False | True | False |
| 2 | True | False | False |
| 3 | False | False | True |
| 4 | False | False | False |
Count Missing Values
Use isnull().sum() to count the number of missing values in each column.
df.isnull().sum()A 1
B 1
C 1
dtype: int64
Drop Rows with Missing Values
Use dropna() to remove rows containing missing values from the DataFrame.
# Drops rows with na values
df.dropna(inplace=True)
# or
df = df.dropna()View DataFrame After Dropping Rows
Display the DataFrame after removing rows with missing values.
df| A | B | C | |
|---|---|---|---|
| 0 | 1.0 | 6.0 | 11.0 |
| 4 | 5.0 | 9.0 | 15.0 |
Reset Index After Dropping Rows
Use reset_index(drop=True) to reset the DataFrame index after dropping rows.
df.reset_index(drop=True)| A | B | C | |
|---|---|---|---|
| 0 | 1.0 | 6.0 | 11.0 |
| 1 | 5.0 | 9.0 | 15.0 |
Create Another DataFrame with Missing Values
This section creates a new DataFrame with missing values for further operations.
data1 = {'A': [1,2,3,4,5],
'B': [6,np.nan,7,8,9],
'C': [11,12,13,np.nan,15]
}
df1 = pd.DataFrame(data1)
df1| A | B | C | |
|---|---|---|---|
| 0 | 1 | 6.0 | 11.0 |
| 1 | 2 | NaN | 12.0 |
| 2 | 3 | 7.0 | 13.0 |
| 3 | 4 | 8.0 | NaN |
| 4 | 5 | 9.0 | 15.0 |
Drop Columns with Missing Values
Use dropna(axis=1) to remove columns containing missing values from the DataFrame.
df1 = df1.dropna(axis=1)
df1| A | |
|---|---|
| 0 | 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
Create DataFrame for Threshold Example
This section creates a DataFrame to demonstrate dropping rows based on a threshold of non-missing values.
data2 = {'A': [1,2,3,4,5],
'B': [6,np.nan,7,np.nan,9],
'C': [11,12,13,np.nan,15]
}
df2 = pd.DataFrame(data2)
df2| A | B | C | |
|---|---|---|---|
| 0 | 1 | 6.0 | 11.0 |
| 1 | 2 | NaN | 12.0 |
| 2 | 3 | 7.0 | 13.0 |
| 3 | 4 | NaN | NaN |
| 4 | 5 | 9.0 | 15.0 |
Drop Rows Based on Threshold
Use dropna(thresh=2) to keep only rows with at least 2 non-missing values.
df2 = df2.dropna(thresh=2)
df2| A | B | C | |
|---|---|---|---|
| 0 | 1 | 6.0 | 11.0 |
| 1 | 2 | NaN | 12.0 |
| 2 | 3 | 7.0 | 13.0 |
| 4 | 5 | 9.0 | 15.0 |
Fill Missing Values with Zero
Use fillna(0) to replace all missing values in the DataFrame with zero.
df2 = df2.fillna(0)
df2| A | B | C | |
|---|---|---|---|
| 0 | 1 | 6.0 | 11.0 |
| 1 | 2 | 0.0 | 12.0 |
| 2 | 3 | 7.0 | 13.0 |
| 4 | 5 | 9.0 | 15.0 |
Create DataFrame for Fill Methods
This section creates a DataFrame to demonstrate different methods for filling missing values.
data3 = {'A': [1,2,3,4,5],
'B': [6,np.nan,7,np.nan,9],
'C': [11,12,13,np.nan,15]
}
df3 = pd.DataFrame(data2)
df3| A | B | C | |
|---|---|---|---|
| 0 | 1 | 6.0 | 11.0 |
| 1 | 2 | NaN | 12.0 |
| 2 | 3 | 7.0 | 13.0 |
| 3 | 4 | NaN | NaN |
| 4 | 5 | 9.0 | 15.0 |
Fill Missing Values with Mean or Median
Use fillna(df.mean()) or fillna(df.median()) to replace missing values with the mean or median of each column.
df3.fillna(df3.mean())
df3.fillna(df3.median())| A | B | C | |
|---|---|---|---|
| 0 | 1 | 6.0 | 11.0 |
| 1 | 2 | 7.0 | 12.0 |
| 2 | 3 | 7.0 | 13.0 |
| 3 | 4 | 7.0 | 12.5 |
| 4 | 5 | 9.0 | 15.0 |
Fill Missing Values with Forward/Backward Fill
Use fillna(method='ffill') for forward fill and fillna(method='bfill') for backward fill to propagate non-missing values.
df3.fillna(method='ffill')
df3.fillna(method='bfill')C:\Users\adila\AppData\Local\Temp\ipykernel_7712\3709391602.py:1: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
df3.fillna(method='ffill')
C:\Users\adila\AppData\Local\Temp\ipykernel_7712\3709391602.py:2: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
df3.fillna(method='bfill')
| A | B | C | |
|---|---|---|---|
| 0 | 1 | 6.0 | 11.0 |
| 1 | 2 | 7.0 | 12.0 |
| 2 | 3 | 7.0 | 13.0 |
| 3 | 4 | 9.0 | 15.0 |
| 4 | 5 | 9.0 | 15.0 |